A Large Self-Annotated Corpus for Sarcasm

نویسندگان

  • Mikhail Khodak
  • Nikunj Saunshi
  • Kiran Vodrahalli
چکیده

We introduce the Self-Annotated Reddit Corpus (SARC), a large corpus for sarcasm research and for training and evaluating systems for sarcasm detection. The corpus has 1.3 million sarcastic statements — 10 times more than any previous dataset — and many times more instances of non-sarcastic statements, allowing for learning in both balanced and unbalanced label regimes. Each statement is furthermore self-annotated — sarcasm is labeled by the author, not an independent annotator — and provided with user, topic, and conversation context. We evaluate the corpus for accuracy, construct benchmarks for sarcasm detection, and evaluate baseline methods.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Putting Sarcasm Detection into Context: The Effects of Class Imbalance and Manual Labelling on Supervised Machine Classification of Twitter Conversations

Sarcasm can radically alter or invert a phrase’s meaning. Sarcasm detection can therefore help improve natural language processing (NLP) tasks. The majority of prior research has modeled sarcasm detection as classification, with two important limitations: 1. Balanced datasets, when sarcasm is actually rather rare. 2. Using Twitter users’ self-declarations in the form of hashtags to label data, ...

متن کامل

"sure, I Did the Right Thing": a System for Sarcasm Detection in Speech

While a fair amount of work has been done on automatically detecting emotion in human speech, there has been little research on sarcasm detection. Although sarcastic speech acts are inherently subjective, humans have relatively clear intuitions as to what constitutes sarcastic speech. In this paper, we present a system for automatic sarcasm detection. Using a new acted speech corpus that is ann...

متن کامل

Multilevel Annotation of Agreement and Disagreement in Italian News Blogs

In this paper, we present a corpus of news blog conversations in Italian annotated with gold standard agreement/disagreement relations at message and sentence levels. This is the first resource of this kind in Italian. From the analysis of ADRs at the two levels emerged that agreement annotated at message level is consistent and generally reflected at sentence level, and that the structure of d...

متن کامل

Irony and Sarcasm: Corpus Generation and Analysis Using Crowdsourcing

The ability to reliably identify sarcasm and irony in text can improve the performance of many Natural Language Processing (NLP) systems including summarization, sentiment analysis, etc. The existing sarcasm detection systems have focused on identifying sarcasm on a sentence level or for a specific phrase. However, often it is impossible to identify a sentence containing sarcasm without knowing...

متن کامل

A Corpus for Research on Deliberation and Debate

Deliberative, argumentative discourse is an important component of opinion formation, belief revision, and knowledge discovery; it is a cornerstone of modern civil society. Argumentation is productively studied in branches ranging from theoretical artificial intelligence to political rhetoric, but empirical analysis has suffered from a lack of freely available, unscripted argumentative dialogs....

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1704.05579  شماره 

صفحات  -

تاریخ انتشار 2017